Multivariate data visualization

Lecture 4

732A98

Multivariate data visualization

Continious variables involved in the following:

  • Parallel coordinate plots
  • Heatmaps
  • Star charts

Parallel coordinates

Construction:

  • Vertical axis: Values
  • Horizontal Axis : Variables
  • 1 trace line = 1 observation

Analysis: - Clusters - Outliers - Correlated variables

Parallel coordinates

Analysis
  • Positive correlation between two adjacent variables: almost all segments are parallel to each other
  • Clusters in some variable space: several trace lines that are near each other and have similar pattern
  • Outliers: trace lines that have unusual pattern and/or fall out outside the common plot area
Problems
  • Trace lines overlap each other -> difficult to find patterns, difficult to follow a specific trace line
  • Analysis depends much on the order of variables (correlation, clusters) -> a proper reordering may improve the analysis

Parallel coordinates

Example: Iris dataset - How many clusters do you see?

2.02.53.03.54.0Sepal Width4.4 24.55.05.56.06.57.07.5Sepal Length7.94.3 1 2 3 4 5 6Petal Length6.9 1

Parallel coordinates

  • Sometimes clusters overlap with categories given by some variable
    • Non-mixing groups is not the same as clustering!
2.02.53.03.54.0Sepal Width4.4 2 5 6 7Sepal Length7.94.3 1 2 3 4 5 6Petal Length6.9 1

Ordering problem

  • Problem of ordering (variables, observations) is one of the key problems in multidimensional visualization
    • Sometimes has a huge impact on perception (heatmaps)
  • A lot of approaches exist

Problem formulation: Given data set χ=(xij|i=1,…,n,j=1,…,p)

  • Select order Ψ=i1,…,ip that optimizes visual perception (analysis) -> this defines reordering of data columns Ψ:χ→χ′

Note: p! possible orderings exist…

Ordering problem

Solution
  • early approaches (for ex. Ankerst et al. 1998):
  1. Choose a distance (proximity) matrix D={dij=d(xi,xj)} between variables (columns)
    • Euclidian distance on scaled columns
    • 1- correlation
  2. This defines graph with vertices 1,…,p and edge weights dij -> Hamiltonian path (Traveling Salesman Problem)

minΨ∑j=1p−1dj,j+1′

  • TSP is NP-complete -> Approximate solutions are used

Ordering problem

Solution: modern approaches

  • Based on:
    • Decreasing visual clutter
    • Clustering data points/dimensions
    • Outlier detection
    • Dimensionality reduction (for ex. MDS)
    • …
  • Note: most of these can be applied both for ordering observations and ordering variables
    • Just transpose the data matrix…

Ordering problem

Objective functions:

  • Gradient measures (anti-Robinson)
  • Hamiltonian path length
  • Least squares
  • …

They based on minΨL(Ψ(D))

Optimization algorithms:

  • Partial enumeration
  • Traveling salesman solvers
  • Hierarchical clustering
  • …

Gradient measures

Aim: distances should increase from diagonal

dik≤dij for 1≤i<k<j≤n
dkj≤dij for 1≤i<k<j≤n
Objective function:

L(D)=∑i<k<jf(dik,dij)+f(dkj,dij)
where
f(z,y)=sign(z−y) or f(z,y)=z−y

Other objectives

Hamiltonian path length:

L(D)=∑i=1n−1di,i+1

Least squares criterion (PCA)

  • Solution is similar to first PCA component

L(D)=∑i∑j(dij−|i−j|)2

Optimization algorithms

Partial enumeration methods

  • Ex: Branch and bounds and dynamic programming
  • Constructing solutions by parts

TSP solver

  • Suitable for hamiltonian path objective
  • Find shortest path by dynamic programming or heuristics

Optimization algorithms

Hierarchical clustering

  • Observations are joined into clusters
  • Clusters are joined in larger clusters
  • Until only one cluster left
  • Leaves and branches are permuted to minimize given objective

Effect of ordering

0 2mpg2.29127161552575-1.60788261591884 -1 0 1cyl1.01488214956065-1.22485776671113 0disp1.94675381465822-1.28790993406115 0 2hp2.74656682471405-1.38103177545363 0 2drat2.49390411483094-1.56460776081613 0 2wt2.25533569784827-1.7417722275101 0 2qsec2.82675459296249-1.87401028323348
0 2mpg2.29127161552575-1.60788261591884 0 2drat2.49390411483094-1.56460776081613 0 2qsec2.82675459296249-1.87401028323348 0 2hp2.74656682471405-1.38103177545363 -1 0 1cyl1.01488214956065-1.22485776671113 0disp1.94675381465822-1.28790993406115 0 2wt2.25533569784827-1.7417722275101

Heatmaps

A heat map visualizes a matrix [ n x m]

  • Normally rows=observations, columns= parameters
  • Heatmap has the corresponding size
  • Each cell of the matrix corresponds a cell in the heatmap
  • High values correspond intense colors in this map (or visa versa for other color schemes!)
  • Names of variables and observations are shown

Heatmaps

mpgcyldisphpdratwtqsecvsamgearcarbMazda RX4Datsun 710Hornet SportaboutDuster 360Merc 230Merc 280CMerc 450SLCadillac FleetwoodChrysler ImperialHonda CivicToyota CoronaAMC JavelinPontiac FirebirdPorsche 914-2Ford Pantera LMaserati Bora
02

Heatmaps

Analysis:

  • Compare the values of a parameter for different observations (row)
  • Compare the values for a single observation (column)
  • Compare the patterns for different rows or columns
  • Find similar observations (areas with the same color intensity)
  • Find which variables define similarity for a group
  • Find correlated variables (similar pattern within a column)

Heatmaps

  • Exercise (last picture):
    • How many clusters do you see?
    • Which variables define clusters?
    • Which variables are correlated?

Effect of reordering

  • Gradient measure objective used
  • See new analysis possibilities
mpgdratqsechpcyldispwtLincoln ContinentalChrysler ImperialFord Pantera LDuster 360Dodge ChallengerHornet SportaboutMerc 450SLValiantFerrari DinoMerc 280Mazda RX4Toyota CoronaMerc 240DLotus EuropaMerc 230Toyota Corolla
−1012

Radar charts

  • Use polar coordinate system
  • Map column value as a coordinate in certain direction

Radar charts

If juxtaposed, analyse:

  • Clusters
  • Outliers
  • Outlying directions

If superimposed,

  • Comparing variable length
  • Seeing similar and outlying observations

Radar charts

mpgcyldisphpdratwtqsecvsamgearcarb−1−0.500.51
Mazda RX4Mazda RX4 WagDatsun 710

Radar charts

Problems:
  • Difficult to judge orientations
  • Number of dimensions are observations is very limited
    • Number of observations is extremely limited if superimposed
  • More close radar charts easier to compare
  • Perception is much affected by observation ordering

Ordering:

  • Same as before plus
  • Dimensions can be sorted to promote more symmetric charts

Radar charts

  • Now with reordering by Gradient Measures

Radar charts

Other positioning possible - PCA/MDS

Trellis plots / facets

Idea:

  1. Make same kind of plot for subsets of data
  2. Plot together
  3. See patterns/differences

Analogy: cutting a sausage

Trellis plots / facets

  • Example: Barley data
    • Anything strange?

Trellis plots / facets

  • Faceting = one more aesthetics

  • What can be analysed?
    • Patterns within/between plots
    • Conditional dependence Y∼X|Z
    • Variable interaction, additivity

–> Useful tool for modeling!

  • Compare : 3D- scatter plots

Trellis plots / facets

  • Another car data: is there additivity?

Trellis plots / facets

  • Design issues:
    • How to order rows/columns in trellis?
      • A: X=one var, Y0=another var (facet_grid)
      • B: independently of aes (facet_wrap)
    • How to handle categorical vars?
      • One value/panel
      • Group
      • Ordering? (R: decide factor levels)
    • How to handle real-valued vars?
      • Split equal size/length
      • Shingles

Shingles

  • Creates overlap
  • To avoid boundary effects

Example: Aids data (Age, Time of Death, Time of Diag)

Shingles

  • Aids data: conclusions?

Read at home

  • Chapter 5

  • Paper "Hahsler, M., Hornik, K., & Buchta, C. (2008). Getting things in order: an introduction to the R package seriation. Journal of Statistical Software, 25(3), 1-34".

  • (Browse through) paper "Ankerst, M., Berchtold, S., & Keim, D. A. (1998, October). Similarity clustering of dimensions for an enhanced visualization of multidimensional data. In Information Visualization, 1998. Proceedings. IEEE Symposium on (pp. 52-60). IEEE."

  • Becker, R. A., Cleveland, W. S., & Shyu, M. J. (1996). The visual design and control of trellis display. Journal of computational and Graphical Statistics, 5(2), 123-155.